Automatic Corpus Extension for Data-driven Natural Language Generation

نویسندگان

Elena Manishina

Bassam Jabaian

Stéphane Huet

Fabrice Lefèvre

چکیده

As data-driven approaches started to make their way into the Natural Language Generation (NLG) domain, the need for automation of corpus building and extension became apparent. Corpus creation and extension in data-driven NLG domain traditionally involved manual paraphrasing performed by either a group of experts or with resort to crowd-sourcing. Building the training corpora manually is a costly enterprise which requires a lot of time and human resources. We propose to automate the process of corpus extension by integrating automatically obtained synonyms and paraphrases. Our methodology allowed us to significantly increase the size of the training corpus and its level of variability (the number of distinct tokens and specific syntactic structures). Our extension solutions are fully automatic and require only some initial validation. The human evaluation results confirm that in many cases native users favor the outputs of the model built on the extended corpus.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic Generation of a Multi Agent System for Crisis Management by a Model Driven Approach

Considering the increasing occurrences of unexpected events and the need for pre-crisis planning in order to reduce risks and losses, modeling instant response environments is needed more than ever. Modeling may lead to more careful planning for crisis-response operations, such as team formation, task assignment, and doing the task by teams. A common challenge in this way is that the model shou...

متن کامل

Evaluating a dialog language generation system: comparing the mountain system to other NLG approaches

This paper describes the MOUNTAIN language generation system, a fully-automatic, data-driven approach to natural language generation aimed at spoken dialog applications. MOUNTAIN uses statistical machine translation techniques and natural corpora to generate human-like language from a structured internal language, such as a representation of the dialog state. We briefly describe the training pr...

متن کامل

Concordance-Based Data-Driven Learning Activities and Learning English Phrasal Verbs in EFL Classrooms

In spite of the highly beneficial applications of corpus linguistics in language pedagogy, it has not found its way into mainstream EFL. The major reasons seem to be the teachers’ lack of training and the unavailability of resources, especially computers in language classes. Phrasal verbs have been shown to be a problematic area of learning English as a foreign language due to their semantic op...

متن کامل

Neural Sentence Ordering

Sentence ordering is a general and critical task for natural language generation applications. Previous works have focused on improving its performance in an external, downstream task, such as multi-document summarization. Given its importance, we propose to study it as an isolated task. We collect a large corpus of academic texts, and derive a data driven approach to learn pairwise ordering of...

متن کامل

Automatic Tweet Generation From Traffic Incident Data

We examine the use of traffic information with other knowledge sources to automatically generate natural language tweets similar to those created by humans. We consider how different forms of information can be combined to provide tweets customized to a particular location and/or specific user. Our approach is based on data-driven natural language generation (NLG) techniques using corpora conta...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2016

Automatic Corpus Extension for Data-driven Natural Language Generation

نویسندگان

چکیده

منابع مشابه

Automatic Generation of a Multi Agent System for Crisis Management by a Model Driven Approach

Evaluating a dialog language generation system: comparing the mountain system to other NLG approaches

Concordance-Based Data-Driven Learning Activities and Learning English Phrasal Verbs in EFL Classrooms

Neural Sentence Ordering

Automatic Tweet Generation From Traffic Incident Data

عنوان ژورنال:

اشتراک گذاری